Methods for the Classification of Data from Open-Ended Questions in Surveys

Disputation
16 April 2024

Camille Landesvatter

University of Mannheim

Research Questions and Motivation

Which methods can we use to classify data from open-ended survey questions?
Can we leverage these methods to make empirical contributions to substantial questions?

Motivation:

1️⃣ Increase in methods to collect natural language (e.g., smartphone surveys with voice technologies) requires the evaluation of available methods.

2️⃣ Special structure of open-ended survey answers (e.g., shortness, lack of context) requires the testing of ML methods for the survey context, e.g., word embeddings.

Methods for Analyzing Data from Open-Ended Questions

Table 1. Overview of methods for classifying open-ended survey responses. Source: Own depiction.

Overview of Studies

Study 1 Study 2 Study 3
How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning Open-ended survey questions: A comparison of information content in text and audio response formats Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning

Landesvatter, C., & Bauer, P. C. (2024). How Valid Are Trust Survey Measures? New Insights From Open-Ended Probing Data and Supervised Machine Learning. Sociological Methods & Research, 0(0). https://doi.org/10.1177/00491241241234871

Study 1: Characteristics

  • Background: ongoing debates about which type of trust survey researchers are measuring with traditional survey items (i.e., equivalence debate cf. Bauer & Freitag 20181)

  • Research Question: How valid are traditional trust survey measures?

  • Questionnaire Design: 5 open-ended questions per respondent, block-randomized order

  • Data: U.S. non-probability sample; \(n\)=1,500 with 7,497 open answers

Study 1: Methodology

Figure 1: Supervised Classification for a Trust Question.

Supervised classification approach:

    1. manual labeling of randomly sampled documents (n=[1,000/1,500])
    1. fine-tuning the weights of two BERT1 models, using the manually coded data as training data, to classify the remaining n=[6,500/6,000]

Study 1: Results

ID Measure Trust Probing Answer Associations (known others) Associations (sentiment)
123 Most people 0.33 I was thinking of people I don’t know personally. 0 (No) 0 (neutral/positive)
3139 Most people 0.17 Tourists that come to our little village. I tend to be very wary of them. 0 (No) 1 (negative)
2980 Stranger 0 No one in particular, but I don’t think I could trust anyone ever again. 0 (No) 1 (negative)
4286 Watching a loved one 0 A former neighbor of mine who was a single father with a son close to my son’s age. 1 (Yes) 0 (neutral/positive)
Table 2: Illustration of exemplary data. Note: n=7,497.

Open-ended survey questions: A comparison of information content in text and audio response formats

Landesvatter, C., & Bauer, P. C. (February 2024). Open-ended survey questions: A comparison of information content in text and audio response formats. Working Paper submitted to Public Opinion Quarterly.

Study 2: Characteristics

  • Background: requests for spoken answers are assumed to trigger an open narration with more intuitive and spontaneous answers (e.g., Gavras et al. 20221)

  • Research Question: Are there differences in information content between responses given in voice and text formats?

  • Experimental Design: random assignment into either the text or voice condition

Study 2: Methodology

  • Operationalization of information content in open answers via application of measures from information theory and machine learning

    • response length, number of topics, response entropy
  • Questionnaire Design: 9 open-ended questions per respondent, block-randomized order

  • Data: U.S. non-probability sample; \(n\)=1,461 with \(n_{text}\)=800 and \(n_{audio}\)=661

    • average item non-response rate text: 1%
    • average item non-response rate audio: 53%

Study 2: Results

Figure 2: Information Content Measures across Questions.
Note. CIs are 95%, n_vote-choice: 830 (audio: 225, text: 605), n_future-children: 1,337 (audio: 389, text: 748)

Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

Landesvatter, C., & Bauer, P. C. (March 2024). Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?. Working Paper submitted to American Political Science Review.

Study 3: Characteristics

  • Background: conventional notion stating that trust originates from informed, rational, and consequential judgments is challenged by the idea of an “affect-based” form of (political) trust (e.g., Theiss-Morse and Barton 20171)

  • Research Question: Are individual trust judgments in surveys driven by affective rationales?

  • Questionnaire Design: voice condition only

  • Data: U.S. non-probability sample; \(n\)=1,474 with 491 audio open answers

Study 3: Methodology

Figure 3: Methods for Sentiment and Emotion Analysis1.

Study 3: Results

Figure 4: Emotion Recognition for Speech Data with SpeechBrain.
Note. n_neutral=408, n_anger=44, n_sadness=18, n_happiness=21.

Summary

  • Web surveys allow to collect narrative answers that provide valuable insights into survey responses
    • think aloud, associations, emotions, tonal cues, additional info, etc.
  • New technologies (smartphone surveys, speech-to-text algorithms) can be used to collect such data in innovative ways
  • Analyzing natural language can inform various debates, e.g.:
    • Study 1: equivalence debate in trust research
    • Study 2: survey questionnaire design research
    • Study 3: cognitive-versus-affective debate in political trust research
    • Study 1-3: item and data quality in general (e.g., associations, information content, sentiment, emotions)

Machine Learning and Open-ended Answers

Large Language models (LLMs) facilitate the accessibility and implementation of semi-automated methods.

  • traditional semi-automated methods, e.g. supervised ML, require sufficient and high-quality training data (i.e., labeled examples)

  • surveys often don’t provide thousands of documents

  • LLMs allow less resource-intensive and domain-specific finetuning and remove the need to build complex systems from scratch

  • E.g., Study 1: Random Forest with 1,500 labeled examples versus BERT

Machine Learning and Open-ended Answers

  • But: consider the complexity and limited transparency of these models
    • always start with simple methods and evaluate
      • Study 1: Random Forest → BERT
      • Study 3: dictionary approach → deep learning
    • accuracy-explainability trade-off

Machine Learning and Open-Ended Answers

Increasing number of possibilities to reduce manual input to a minimum.

  • Study 3: zero-shot prompting result in similar findings than fine-tuned versions of pre-trained models (e.g., overlap of 80% of GPT-prompting vs. pysentimiento)
  • deciding on a suitable number of manual examples depends on various factors such as the task difficulty
    • few-shot versus zero-shot prompting
  • the lesser the manual input, the more important the manual inspection of results (e.g., Study 2: what are high-entropy documents?)

Fully manual, semi-automated, or fully automated?

The final decision for one of the approaches depends on:

  • difficulty of the given task (e.g., general versus specific codes)

  • size of the available dataset (e.g., n, splits by experimental conditions)

  • structure of the open answers (e.g., length, amount of context → this depends on the question design)

  • the amount and state of previous research (e.g., available code schemes)

  • desired accuracy and desired transparency

  • available resources (e.g., human power, computational power (GPU), time resources)

Thank you for your Attention!